-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extend genes in complete genomes to start/stop codons #5
Conversation
bdb2aa0
to
ec569a9
Compare
It would seem a simple T replaced by an A can have consequences. The main difference remaining between the FGSrs and Prodigal output is now the exclusion of the reverse stop codon in the reverse strands by FGSrs, where Prodigal includes this. Which would be the expected behaviour? |
For me, the expected behaviour is uniform treatment of stop codons for proteins on forward and reverse strands. |
I agree with this. Now FGSrs is inconsistent for + and - in terms of include stop/start codon or not compare to prodigal Jianshu |
a real world application using gene prediction: mapping transriptomic reads(pair ends sequencing)to each gene to see whether there are reads mapped to antisense(reverse strand of gene strand)showed that for reverse reads,the proportion of them mapped to the reverse strand of gene strand is a lot higher than forward reads for FGSrs but ver close for prodigal. So I would think now FGSrs reverse gene position is not so perfect. Many thanks! Jianshu |
NTO_MAG_1_prodigal.new.gff.zip Hello, I tried the newest fix 02cc0fe. Now start and stop codons look good. But for some genes that are not consistent, FGSrs is always short than prodigal (large start position, same stop position). Others looks good. Jianshu |
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 1917354 383036 508737 856214 83.35 69.13 57.05 37.27 23.23
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 1917354 383036 508737 856214 83.35 69.13 57.05 37.27 23.23
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2601595 527580 367165 169001 83.14 93.9 41.04 68.48 42.47
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2650006 517695 371428 126212 83.66 95.45 41.77 74.64 46.59
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2652412 517860 371290 123779 83.67 95.54 41.76 75.0 46.78
Ratings: tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2661082 518282 368500 117477 83.7 95.77 41.55 75.83 47.14
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2661082 518282 368500 117477 83.7 95.77 41.55 75.83 47.14
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2661082 518282 368500 117477 83.7 95.77 41.55 75.83 47.14
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2504714 280484 607613 272530 89.93 90.19 68.42 69.04 58.78
02cc0fe
to
22f2f20
Compare
tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58
d5c451a
to
2a3ebcd
Compare
Hello This time it improves by a lot. The ffn and faa output positions in the head is consistent with gff output but stop and start codon still not included in ffn and faa sequence. Jianshu |
More than that. The ffn and faa sequences (head is correct) output is not consistent with the prodigal fna and faa output even though the gff output is the same with prodigal gff output (output ffn part not fixed according to new gene positions). Everything else looks good. Many thanks for all those fixes and I think this should be the last one then FGSrs should be almost consistent with prodigal. Thanks again and good luck with the manuscript! You could try bmc bioinformatics. If you would like any testing and comparisons with prodigal/metaGeneMark,I will be very happy to test and by the way Mark Borodosky is my co-advisor here at Georgia tech. Any thing related to model improvement, he may be the one to give some advice. |
And that last commit should finish the PR. You might have noted the scripts in the
Seems like in quality, prodigal still outperforms FragGeneScan(Rs). On the other hand, FGSrs is 37.3 times faster for the used dataset (see meta/evaluation/README.md). We weren't really working on model improvement for now, just speed and code quality while replicating FGS behaviour, but this might be interesting in the future. |
@jianshu93 I'd be interested in a final review from you, too, but I can't seem to request it via the interface. |
Hello Felix, I had a look at the evaluation and your benchmarking. I do not see anything wrong this time. My testings are nice considering output. I think the genome you are using are complete genomes right? prodigal actually use a different model for complete genomes. The reason why I use (-p meta) in prodigal is that my genome is not complete genome (there are gaps, not circular because they are assembly with short reads from metagenomes, not cultivated single genome). I think you would also want to try no meta to compare even though the complete model also works for contigs/assemblies (not circular) according to the original FGS description. Prodigal is a little bit different for complete and meta mode. See the paper here: https://academic.oup.com/bioinformatics/article/28/17/2223/246063?login=true I think in the future FGSrs should also differentiate between complete genome and genome with gaps or assemblies. Jianshu |
Hello Felix, when all those commits are merged, please also create a condo recipe for FGSrs. A lot of softwares like the Metagenomics binning software MaxBin are now using FGS, which is very slow for large metagenomes. This could be a greate improvement Jianshu |
Hello Felix, Another error I notice: thread '' panicked at 'attempt to divide by zero', src/dna.rs:644:21 For some contig (less than 1000 bp, can reach 400 bp) run with compete mode there will always be this error. Any idea why? Thanks, Jianshu |
That last commit should fix the 'divide by zero'-panick: I was dividing by the length of the sequence without checking whether it was empty. I've also put adding FGSrs to bioconda on the planning. It probably shouldn't be hard, since I can just follow the FGS recipe. Could you perhaps suggest a good contig/assembly for running this test on? Many aren't properly annotated, and without annotations I can't make the same evaluation. (Unless I assume prodigal is always correct, and compare FGSrs to prodigal, not to the annotations.) |
87cef70
to
437ef43
Compare
Hello Felix, Sorry for such late response. I ran a test and this version is ok. I am wondering whether we can implement a fast AAI (average amino acid identity) module following FGSrs. which is to find best AA gene of one genome in another genome and also reverse. If the 2 are reciprocal best hit and then we keep this pair and do it for all genes in one genome and another genome and keep all reciprocal best hits and calculate average AAI based on all those genes. Seems search will take some time for each gene, diamond blastp could be better than we implement local alignment ourselves based on rust-bio. do you have better idea if we do not rely on external database search software such as diamond to calculate AAI? Many thanks, Jianshu |
I'm afraid I'm currently finishing my PhD, so I don't really have time to look into tangents at the moment. This does sound like an interesting idea though. |
Hello Felix, Good luck with you Ph.D! It's glad to hear that you are finishing it! I still have 3 years to go! Anyhow, I will implement it myself, possibly rely on blast+ suite to do best hit searching. Jianshu |
Hello Felix, Can you also add a support for gzipped fasta file as input? Should be very easy to implement? Or is is already supported? Think about when we have huge number of genomes downloaded from NCBI, which are all in *.fna.gz format. Thanks, Jianshu |
Fixes #3